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Network Newton-Part I: Algorithm and Convergence 
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Abstract —We study the problem of minimizing a sum of convex 
objective functions where the components of the objective are 
available at different nodes of a network and nodes are allowed 
to only communicate with their neighbors. The use of distributed 
gradient methods is a common approach to solve this problem. 
Their popularity notwithstanding, these methods exhibit slow 
convergence and a consequent large number of communications 
between nodes to approach the optimal argument because they 
rely on first order information only. This paper proposes the 
network Newton (NN) method as a distributed algorithm that 
incorporates second order information. This is done via distributed 
implementation of approximations of a suitably chosen Newton 
step. The approximations are obtained by truncation of the Newton 
step’s Taylor expansion. This leads to a family of methods defined 
by the number K of Taylor series terms kept in the approximation. 
When keeping K terms of the Taylor series, the method is 
called NN-A' and can be Implemented through the aggregation of 
information in fy-hop neighborhoods. Convergence to a point close 
to the optimal argnment at a rate that is at least linear is proven 
and the existence of a tradeoff between convergence time and 
the distance to the optimal argument is shown. Convergence rate, 
several practical implementation matters, and numerical analyses 
are presented in a companion paper |^. 

Index Terms —Multi-agent network, distributed optimization, 
Newton’s method. 


I. Introduction 

Distributed optimization algorithms are used to solve the 
problem of minimizing a global cost function over a set of nodes 
in situations where the objective function is dehned as a sum of 
local functions. To be more precise, consider a variable x S 
and a connected network containing n agents each of which has 
access to a local function K. The agents cooperate 

in minimizing the aggregate cost function / : —>■ M taking 

values /(x) := agents cooperate in solving 

the global optimization problem 

n 

X* := argmin/(x) = argmin^/i(x) . (1) 

X X . - 

1—1 

Problems of this form arise often in, e.g., decentralized control 
systems 14|-0, wireless systems 0-0, sensor networks Gg- 
03^ and large scale machine learning Gl-ITg. In the latter 
case, distributed formulations are efficient in dealing with very 
large datasets where it is desirable to split training sets into 
smaller subsamples that are assigned to different servers 
In this paper we assume that the local costs fi are twice 
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differentiable and strongly convex. Therefore, the aggregate cost 
function / is also twice differentiable and strongly convex. 

There are different algorithms to solve Q in a distributed 
manner. The most popular choices are decentralized gradient 
descent (DGD) flTl-pOl, distributed implementations of the 
alternating direction method of multipliers |[T0| , and 

decentralized dual averaging (DDA) p4| , |25[ . Although there 
are substantial differences between them, these methods can be 
genetically abstracted as combinations of local descent steps 
followed by variable exchanges and averaging of information 
among neighbors. A feature common to all of these algorithms 
is the slow convergence rate in ill-conditioned problems since 
they operate on hrst order information only. This is not surpris¬ 
ing because gradient descent methods in centralized settings 
where the aggregate function gradient is available at a single 
server have the same difficulties in problems with skewed 
curvature [see Chapter 9 of |26|.] 

This issue is addressed in centralized optimization by New¬ 
ton’s method that uses second order information to determine 
a descent direction adapted to the objective’s curvature [see 
Chapter 9 of p6)]. In general, second order methods are not 
available in distributed settings because distributed approxima¬ 
tions of Newton steps are difficult to devise. In the particular 
case of flow optimization problems, these approximations are 
possible when operating in the dual domain 1^-0. As 
would be expected, these methods result in large reductions 
of convergence times. 

Our goal here is to develop approximate Newton’s methods to 
solve 0 in distributed settings where agents have access to their 
local functions only and exchange variables with neighboring 
agents. We do so by introducing Network Newton (NN), a 
method that relies on distributed approximations of Newton 
steps for the global cost function / to accelerate convergence 
of DGD. We begin the paper with an alternative formulation 
of 0 and a brief discussion of DGD (Section |IJ. We then 
introduce a reinterpretation of DGD as an algorithm that utilizes 
gradient descent to solve a penalized version of 0 in lieu of 
the original optimization problem (Section |II-A[ ). This reinter¬ 
pretation explains convergence of DGD to a neighborhood of 
the optimal solution. The volume of this neighborhood is given 
by the relative weight of the penalty function and the original 
objective which is controlled by a penalty coefficient. 

If gradient descent on the penalized function finds an approx¬ 
imate solution to the original problem, the same solution can 
be found with a much smaller number of iterations by using 
Newton’s method. Alas, distributed computation of Newton 
steps requires global communication between all nodes in the 
network and is therefore impractical (Section 0- To resolve 
this issue we approximate the Newton step of the penalized 
objective function by truncating the Taylor series expansion 
of the exact Newton step (Section III-A|i. This results in a 
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family of methods indexed by the number of terms of the 
Taylor expansion that are kept in the approximation. The method 
that results from keeping K of these terms is termed NN- 
K. A fundamental observation here is that the Hessian of the 
penalized function has a sparsity structure that is the same 
sparsity pattern of the graph. Thus, when computing terms in 
the Hessian inverse expansion, the first order term is as sparse 
as the graph, the second term is as sparse as the two hop 
neighborhood, and, in general, the fc-th term is as sparse as the 
/c-hop neighborhood of the graph. Thus, implementation of the 
NN-iT method requires aggregating information from K hops 
away. Increasing K makes NN-/f arbitrarily close to Newton’s 
method at the cost of increasing the communication overhead 
of each iteration. 

Convergence of NN-iT to the optimal argument of the 
penalized objective is established (Section 23 . We do so by 
establishing several auxiliary bounds on the eigenvalues of the 
matrices involved in the definition of the method (Propositions 
and Lemma 1^. Of particular note, we show that a measure 
of the error between the Hessian inverse approximation utilized 
by NN-itT and the actual inverse Hessian decays exponentially 
with the method index K. This exponential decrease hints that 
using a small value of K should suffice in practice. Convergence 
is formally claimed in Theorem that shows the convergence 
rate is at least linear. It follows from this convergence analysis 
that larger penalty coefficients result in faster convergence 
that comes at the cost of increasing the distance between the 
optimal solutions of the original and penalized objectives. The 
convergence guarantees established in this paper are not better 
than the corresponding guarantees for DGD. These advantages 
are established in a companion paper where we further show that 
the sequence of penalized objective function values generated 
by NN-iT has a convergence rate that is quadratic in a specific 
interval. This quadratic phase holds for all K and can be 
made arbitrarily large by increasing K Q. Numerical results 
in establish the advantages of NN-iT in terms of number 
of iterations and communications steps relative to DGD and 
establish that using K = 1 or K = 2 tends to work best in 
practice. 

Notation. Vectors are written as x G K" and matrices as 
A G Given n vectors x^, the vector y = [xi;... ;x„] 

represents a stacking of the elements of each individual x^. 
The null space of matrix A is denoted by null (A) and the span 
of a vector by span(x). We use |jx|| to denote the Euclidean 
norm of vector x and ||A|| to denote the Euclidean norm of 
matrix A. The gradient of a function /(x) is denoted as V/(x) 
and the Hessian matrix is denoted as V^/(x). The i-th largest 
eigenvalue of matrix A is denoted by /ii(A). 

H. Distributed Gradient Descent 

The network that connects the n agents is assumed connected, 
symmetric, and specihed by the neighborhoods JVi that contain 
the list of nodes than can communicate with i for i = 1,... ,n. 
In the problem in (0 agent i has access to the local objective 
function /i(x) and agents cooperate to minimize the global 
cost /(x). This specification is more naturally formulated by 
an alternative representation of 0 in which node i selects 


a local decision vector x^ G K^. Nodes then try to achieve 
the minimum of their local objective functions /i(xi), while 
keeping their variables equal to the variables Xj of neighboring 
nodes j G A/i. This alternative formulation can be written as 

n 

{xnr=i := argmin ^ /*(x,), 

s.t. Xj = Xj, for all i,j G Ni- (2) 

Since the network is connected, the constraints x^ = Xj for all 
i and j G Ni imply that Q and (|^ are equivalent in the sense 
that we have x* = x* for all i. This must be the case because 
for a connected network the constraints Xi = xj for all i and 
j G N collapse the feasible space of 0 to a hyperplane in 
which all local variables are equal. When all local variables are 
equal, the objectives in ([T]) and (|^ coincide and, in particular, 
so do their optima. 

DGD is an established distributed method to solve (|^ which 
relies on the introduction of nonnegative weights Wij > 0 that 
are not null if and only if j = i or if j G N- Letting f G N be 
a discrete time index and a a given stepsize, DGD is dehned 
by the recursion 

n 

Xi^t+i = '^WijXj^t - (3) 

Since Wij = 0 when j i and j ^ N, it follows from 0 that 
each agent i updates its estimate x^ of the optimal vector x* by 
performing an average over the estimates x^ t of its neighbors 
j G N and its own estimate x^ t, and descending through 
the negative local gradient —Vfi{xi^t)- DGD is a distributed 
method because to implement ([^, node i exchanges variables 
with neighboring nodes only. 

The weights in ([^ cannot be arbitrary. To express conditions 
on the set of allowable weights define the matrix W G 
with entries Wij. We require the weights to be symmetric, i.e., 
Wij = Wji for all i,j, and such that the weights of a given node 
sum up to 1, i.e., 3J=i *■ *^he weights sum up 

to 1 we must have W1 = 1 which implies that I — W is rank 
dehcient. It is also customary to require the rank of I — W to 
be exactly equal to n — 1 so that the null space of I — W is 
null(I — W) = span(l). We therefore have the following three 
restrictions on the matrix W, 

W^ = W, Wl = l, null(I-W) = span(l). (4) 

If the considions in 0 are true, it is possible to show that 
0 approaches the solution of 0 in the sense that x^ t « x* 
for all i and large t, m- The accepted interpretation of why 
0 converges is that nodes are gradient descending towards 
their local minima because of the term —aNfi{xi t) but also 
perform an average of neighboring variables 3j=i This 

latter consensus operation drives the agents to agreement. In 
the following section we show that 0 can be alternatively 
interpreted as a penalty method. 

A. Penalty method interpretation 

It is illuminating to dehne matrices and vectors so as to 
rewrite (|^ as a single equation. To do so dehne the vectors 
y := [xi;...;x„] and h(y) := [V/i(xi);...; V/„(x„)]. 
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Vector y S concatenates the local vectors x^, and the vector 
h(y) G concatenates the gradients of the local functions 
fi taken with respect to the local variable x^. Notice that h(y) 
is not the gradient of /(x) and that a vector y with h(y) = 0 
does not necessarily solve Q. To solve Q we need to have 
Xi = Xj for all i and j with ^= 0 - In event, 

to rewrite (|^ we also dehne the matrix Z := W® I S 
as the Kronecker product of the weight matrix W G and 

the identity matrix I G It is then ready to see that Q is 

equivalent to 

yt+i = Zyt - ah(yt) = y* - [(I - Z)yt + ah(yt)], (5) 

where in the second equality we added and subtracted yt and 
regrouped terms. Inspection of (|^ reveals that the DGD update 
formula at step t is equivalent to a (regular) gradient descent 
algorithm being used to solve the program 

1 ” 

y* :=argmin F{y) := min - y^(I-Z) y+a ^/,(x,). ( 6 ) 

i=l 

Indeed, given the dehnition of the function F{y) := 
(l/2)y^(I — Z) y + a follows that the gradient 

of F{y) at y = yt is given by 

gt := VF(yt) = (I-Z)yt+ ah(yt). (7) 

Using o we rewrite Q as yt+i = yt — St conclude that 
DGD descends along the negative gradient of F{y) with unit 
stepsize. The expression in (|^ is just a distributed implementa¬ 
tion of gradient descent that uses the gradient in Q. To confirm 
that this is true, observe that the zth element of the gradient 
St = [Si,t;---':St,t] is given by 

Si,t = (1 - Wii)y^%,t - ^ WijXj^t + aVf^{xi,t). ( 8 ) 

j&Mi 

The gradient descent iteration yt+i = yt — St is then equivalent 
to ID if we entrust node i with the implementation of the 
descent x^ 4+1 = x^ j — g^ j, where, we recall, x^ j and x^ 4+1 
are the *th components of the vectors yt and yt+i- Observe that 
the local gradient component g^ t can be computed using local 
information and the x^ ( iterates of its neighbors j G Nt- This 
is as it should be, because the descent x^ = x^ ( — g^ t is 
equivalent to 0 . 

Is it a good idea to descend on F{y) to solve ([T])? To some 
extent. Since we know that the null space of I — W is null(I — 
W) = span(l) and that Z = W ® I we know that the span 
of I — Z is null(I — 7j) — span(l ® I). Thus, we have that 
(I — Z)y = 0 holds if and only if xi = • • • = x„. Since the 
matrix I — Z is positive semidefinite - because it is stochastic 
and symmetric -, the same is true of the square root matrix 
(I — Z)^/^. Therefore, we have that the optimization problem 
in (|^ is equivalent to the optimization problem 

n 

y* := argmin ^ M^i), 

S.t. (I - Z)l/2y = 0. (9) 

Indeed, for y = [xi;...; x„] to be feasible in (|^ we must 
have xi = • • • = x„ because null[(I — Z)^/^] = span(l ® I) 
as already argued. This is the same constraint imposed in (|^ 


from where it follows that we must have y* = [xJ;... ;x*] 
with X* = X* for all i. The unconstrained minimization in 
•HI is a penalty version of ( 0 . The penalty function associated 
with the constraint (I —Z)^/^y = 0 is the squared norm 
( 1 / 2 ) 11(1 — Z)^/^y|p and the corresponding penalty coefficient 
is 1/a. Inasmuch as the penalty coefficient 1/a is sufficiently 
large, the optimal arguments y* and y* are not too far apart. 

The reinterpretation of ([^ as a penalty method demonstrates 
that DGD is an algorithm that hnds the optimal solution of 
( 61 , not (|^ or its equivalent original formulations in Q and 
(2i. Using a fixed a the distance between y* and y* is 
of order 0(a), p^ . To solve © we need to introduce a 
rule to progressively decrease a. In this paper we exploit the 
reinterpretation of as a method to minimize (|^ to propose 
an approximate Newton algorithm that can be implemented in a 
distributed manner. We explain this algorithm in the following 
section. 

III. Network Newton 

Instead of solving with a gradient descent algorithm 
as in DGD, we can solve using Newton’s method. To 
implement Newton’s method we need to compute the Hessian 
Ht := W‘^F{yt) of F evaluated at yt so as to determine the 
Newton step d( := —Hj”^gt. Start by differentiating twice in 
(|^ in order to write Ht as 

lit:=W^F{yt) = I-Z + aGt, ( 10 ) 

where the matrix Gj G is a block diagonal matrix 

formed by blocks Ga^t G containing the Hessian of the 

ith local function, 

= V"/.(x,,t)- ( 11 ) 

It follows from and 0 that the Hessian Ht is block 
sparse with blocks t G having the sparsity pattern 

of Z, which is the sparsity pattern of the graph. The diagonal 
blocks are of the form Ha^t = (1 ~ wu)! + o:V^fi(xt^t) and 
the off diagonal blocks are not null only when j G Ni in which 
case = Wijl. 

While the Hessian Hj is sparse, the inverse Hj is not. It is the 
latter that we need to compute the Newton step d( := H/ ■^gi- 
To overcome this problem we split the diagonal and off diagonal 
blocks of Ht and rely on a Taylor’s expansion of the inverse. 
To be precise, write H* = Dj — B where the matrix Dj is 
dehned as 

D( := aGt + 2 (I - diag(Z)) := aG* + 2 (I - Z^), (12) 

where in the second equality we dehned Z^ := diag(Z) for 
future reference. Since the diagonal weights must be wu < 
1, the matrix I — Z^; is positive dehnite. The same is true of 
the block diagonal matrix G* because the local functions are 
assumed strongly convex. Therefore, the matrix Dt is block 
diagonal and positive dehnite. The zth diagonal block Du^t G 
RP of Dj can be computed and stored by node i as Du^t = 
fi{xi t) + 2(1 — Wu)I. To have H( = Dj — B we must 
dehne B := Dt — Hj. Considering the dehnitions of Hj and 
Dt in ( [Tol l ( [T^ , it follows that 

B = I-2Zd + Z. (13) 
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Observe that B is independent of time and depends on the 
weight matrix Z only. As in the case of the Hessian Hj, the 
matrix B is block sparse with blocks By S having the 

sparsity pattern of Z, which is the sparsity pattern of the graph. 
Node i can compute the diagonal blocks B^^ = (1 — wa)! and 
the off diagonal blocks By = ruyl using the local information 
about its own weights. 

Proceed now to factor D/ from both sides of the splitting 
relationship to write Ht = . 

When we consider the Hessian inverse we can use the 

Taylor series (I-X)-i = Y .%0 with X = 
to write 


H: 


d; 


- 1/2 


oo 

i:(° 


- 1/2 


bd: 


- 1/2 


fc =0 


d: 


- 1/2 


(14) 


Observe that the sum in ([T^ converges if the absolute value of 
all the eigenvalues of the matrix are strictly 

less than 1. For the time being we assume this to be the 
case but we will prove that this is true in Section IV When 
the series converge, we can use truncations of this series to 
dehne approximations to the Newton step as we explain in the 
following section. 


A. Distributed approximations of the Newton step 

Network Newton (NN) is defined as a family of algorithms 
that rely on truncations of the series in ( [T4l l. The K\h member 
of this family, NN-K, considers the first K + 1 terms of the 
series to dehne the approximate Hessian inverse 

(15) 

k^O 

'' (K)~^ 

NN-iT uses the approximate Hessian Hj as a curvature 
correction matrix that is used in lieu of the exact Hessian inverse 
H-i to estimate the Newton step. I.e., instead of descending 
along the Newton step dj := —we descend along 
the NN-iT step := —Hlj^^ gt, which we intend as an 

approximation of d^. Using the explicit expression for ^ 
in CD we write the NN-AT step as 

d[^'> = - D-'/" ^ gt, (16) 

k=0 

where, we recall, the vector gt is the gradient of objective 
function F{y) dehned in Q. The NN-iT update formula can 
then be written as 

Vi-ti = yt + e (17) 


The algorithm dehned by recursive application of CD can 
be implemented in a distributed manner because the truncated 
series in ( fTS] ) has a local structure controlled by the parameter 
K. To explain this statement better dehne the components 




of the NN-iT step = [d![ 


iW. 




distributed implementation of ( [T7| ) requires that node i computes 
d|^^ so as to implement the local descent yt-i^t+i = 


The key observation here is that the step component d^^^ 
can indeed be computed through local operations. Specihcially, 


Algorithm 1 Network Newton- AT method at node i 

Require: Initial iterate Xi,o- Weights Wij. Penalty coefficient a. 
1: B matrix blocks: = (1 — wa)! and By = WijI 

2: for t = 0,1, 2,... do 

3: D matrix block: = aV^/i(xi,t) -|- 2(1 — wu)'l 

4: Exchange iterates with neighbors j € Ni. 

5: Gradient: gi,t = (1 — 'Wu)^i,t — ^ + aV 


jeAfi 

Compute NN-0 descent direction 
for fc = 0,..., A' — 1 do 

Exchange elements of the NN-fc step with neighbors 


NN-(fc -b 1) step: d) 


(k+i) _ ■ 


r)~ 

t — 


end for 

Update local iterate: = x^t -|- e d^ ^ 

end for 


^ Byd'*^ - gi,. 
3 eJ^i, 3 


{K) 


begin by noting that as per the dehnition of the NN-AT descent 
direction in ( [T6| ) the sequence of NN descent directions satishes 

d(fc+i) ^ D,-iBdf) - = D,-' (Bdf ^ - gt) . (18) 

Then observe that since the matrix B has the sparsity pattern 
of the graph, this recursion can be decomposed into local 
components 



The matrix = aV^/i(xy() + 2(1 — wu)! is stored 

and computed at node i. The gradient component g^ t = 
(1 - w^i)K^^t - J2jGNi is also stored and 

computed at i. Node i can also evaluate the values of the matrix 
blocks Bn = (1 — Wii)l and By = WyT. Thus, if the NN-fc 

(k) 

step components / are available at neighboring nodes j, node 
i can then determine the NN-(fc + 1) step component 
upon being communicated that information. 

The expression in ( [T^ represents an iterative computation 
embedded inside the NN-AT recursion in For each time 
index t, we compute the local component of the NN-0 step 
d|°i = —D^^jgi J. Upon exchanging this information with 
neighbors we use ( [T9| l to determine the NN-1 step components 
d/j/ These can be exchanged and plugged in ( [T^ to compute 
d) I. Repeating this procedure AT times, nodes ends up having 
determined their NN-A' step component d^^^ . 

The resulting NN-AT method is summarized in Algorithm 

The descent iteration in ([TD is implemented in Step 11. 
Implementation of this descent requires access to the NN-AT 
descent direction which is computed by the loop in steps 
6-10. Step 6 initializes the loop by computing the NN-0 step 
dj-°^ = —Djj jgi_(. The core of the loop is in Step 9 which 
corresponds to the recursion in GD- Step 8 stands for the 
variable exchange that is necessary to implement Step 9. After 
AT iterations through this loop, the NN-AT descent direction 
dj-^^ is computed and can be used in Step 11. Both, steps 6 and 
9, require access to the local gradient component g, j. This is 
evaluated in Step 5 after receiving the prerequisite information 
from neighbors in Step 4. Steps 1 and 3 compute the blocks 
Bii^t, By.t, and Bu^t that are also necessary in steps 6 and 9. 
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IV. Convergence Analysis 

In this section we show that as time progresses the sequence 
of objective function values F{yt) [cf. approaches the 
optimal objective function value F{y*). In proving this claim 
we make the following assumptions. 

Assumption 1 There exists constants 0 < <5 < A < 1 that 
lower and upper bound the diagonal weights for all i, 

0 < 5 < Wii < A < 1 , i = 1,. .. ,n. (20) 

Assumption 2 The local objective functions fi(pf) are twice 
differentiable and the eigenvalues of the local objective function 
Hessians are bounded with positive constants 0 < m < M < 
oo, i.e. 

ml A VV,(x) A Ml. (21) 

Assumption 3 The local objective function Hessians V^/i(x) 
are Lipschitz continuous with respect to the Euclidian norm with 
parameter L. I.e., for all x, x G it holds 

l|V^/*(x) - V^/*(x)|| < L||x-x||. (22) 

Notice that the lower bound in Assumption [T] is more a 
dehnition than a constraint since we may have (5 = 0. This is not 
recommendable as it is implies that the weight wu assigned to 
the local variable x^ in (|^ is null, but nonetheless allowed. The 
upper bound A < 1 on the weights wu is true for all connected 
networks as long as neighbors j G A/) are assigned nonzero 
weights Wij > 0. This is because the matrix W is doubly 
stochastic [cf. Q], which implies that wa = 1—^ ^ 
as long as Wij > 0 . 

The lower bound m for the eigenvalues of local objective 
function Hessians V^/i(x) is equivalent to the strong convexity 
of local objective functions /; (x) with parameter m. The strong 
convexity assumption for the local objective functions /i(x) 
stated in Assumption is customary in convergence proofs of 
Newton-based methods, since the Hessian of objective function 
should be invertible to establish Newton’s method [Chapter 
9 of |[2^]. The upper bound M for the eigenvalues of local 
objective function Hessians V^/i(x) is similar to the condition 
that gradients V/i(x) are Lipschitz continuous with parameter 
M for the case that functions are twice differentiable. 

The restriction imposed by Assumption 1^ is also typical of 
second order methods pT) . Assumption [3]guarantees that the 
Hessian matrices of objective functions F{y) are also Lipschitz 
continuous as we show in the following lemma. 

Lemma 1 Consider the definition of objective function F{y) in 
If Assumption holds then the objective function Hessian 
H(y) =: '^‘^F(y) is Lipschitz continuous with parameter otL, 
i.e. 

||H(y)-H(y)|l <aL||y-y||. (23) 

for all y, y S 

Proof: See Appendix [A| ■ 

Lemma[T]states that the penalty objective function introduced 
in (|^ has the property that the Hessians are Lipschitz contin¬ 
uous, while the Lipschitz constant is a function of the penalty 


coefficient 1/a. This observation implies that as we increase 
the penalty coefficient 1 /a, or, equivalently, decrease a, the 
objective function F{y) approaches a quadratic form because 
the curvature becomes constant. 

To prove convergence properties of NN we need bounds for 
the eigenvalues of the block diagonal matrix Dj, the block 
sparse matrix B, and the Hessian Hj. These eigenvalue bounds 
are established in the following proposition using the conditions 
imposed by Assumptions [T] and 

Proposition 1 Consider the definitions of matrices H^, Dj, and 
B in ([T 2 |, and ( |13| l, respectively. If Assumptions^and^ 
hold true, then the eigenvalues of matrices Hj, Dj, and B are 
uniformly bounded as 

ami A Ht A (2(1 - ,5) -f aM)I, (24) 

(2(1 - A)-f am)I ^ D* A {2{l - 5) + aM)l, (25) 

0 ^ B A 2(1-5)1. (26) 

Proof: See Appendix [B] ■ 

Proposition [T] states that Hessian matrix Hj and block diag¬ 
onal matrix Dt are positive dehnite, while matrix B is positive 
semidehnite. 

As we noted in Section |I^ for the expansion in |l4| l to 
be valid the eigenvalues of the matrix must 

be nonnegative and strictly smaller than 1. The following 
proposition states that this is true for all times t. 

Proposition 2 Consider the definitions of the matrices Dj in 
'Ell and B in ^13| ). If Assumptions^and^hold true, the matrix 
is positive semidefinite and its eigenvalues are 
bounded above by a constant p < 1 

0 ^ A pi, (27) 

where p := 2(1 — 5)/(2(l — 5) -I- am). 

Proof: See Appendix [C| ■ 

_ X /2 _ 1/2 

The bounds for the eigenvalues of matrix BD^ 

in ( |Z7| ) guarantee convergence of the Taylor series in ( [T^ . As 
mentioned in Section [111] NN-AT truncates the hrst K summands 
of the Hessian inverse Taylor series in d to approximate 
the Hessian inverse of the objective function in optimization 
problem ([^. To evaluate the performance of NN-AT we study 
the error of the Hessian inverse approximation by dehning the 
error matrix Ej S as 

Et := I - . (28) 

Error matrix E* measures closeness of the Hessian inverse 
approximation matrix H) ' and the exact Hessian inverse 
at time t. Based on the definition of error matrix E(, if 
the Hessian inverse approximation H) ’ approaches the exact 
Hessian inverse the error matrix Et approaches the zero 
matrix 0. We therefore bound the error of the Hessian inverse 
approximation by developing a bound for the eigenvalues of 
the error matrix Et. This bound is provided in the following 
proposition where we further show that the error of the Hessian 
inverse approximation for NN-AT decreases exponentially as we 
increases K. 


6 


Proposition 3 Consider the NN-K method as introduced in 
([T2|-([T7) and the definition of error matrix Ej in P8| ). Further, 
recall the definition of the constant p := 2(1 — 5) /{a + 2(1 — 
5)) < 1 in Proposition The error matrix Ej is positive 
semidefinite and all its eigenvalues are upper bounded by 

0 ^ Et ^ p^+H. (29) 

Proof: See Appendix [P] ■ 

Propositionj^asserts that the error in the approximation of the 
Hessian inverse, thereby on the approximation of the Newton 
step, is bounded by This result corroborates the intuition 

that the larger K is, the closer that approximates the 

Newton step. This closer approximation comes at the cost of 
increasing the communication cost of each descent iteration. 
The decrease of this error being proportional to p^^^ hints 
that using a small value of K should suffice in practice. This 
has been corroborated in numerical experiments where K = 1 
and K = 2 tend to work best - see |(3|. Further note that to 
decrease p we can increase 6 or increase a. Increasing 6 calls 
for assigning substantial weight to wa. Increasing a comes at 
the cost of moving the solution of (|^ away from the solution 
of Q and its equivalent Q. 

Bounds on the eigenvalues of the objective function Hessian 
Ht are central to the convergence analysis of Newton’s method 
[Chapter 9 of @]. Lower bounds for the Hessian eigenvalues 
guarantee that the matrix is nonsingular. Upper bounds imply 
that the minimum eigenvalue of the Hessian inverse 
is strictly larger than zero, which, in turn, implies a strict 
decrement in each Newton step. Analogous bounds for the 
eigenvalues of the NN approximate Hessian inverses H) ' 
are required. These bounds are studied in the following lemma. 

Lemma 2 Consider the NN-K method as defined in @- 
( |17| l. If Assumptions^and^hold true, the eigenvalues of the 
approximate Hessian inverse are bounded as 

AI A a AI, (30) 

where constants X and A are defined as 

. ^ 1 , . ^_ 1 - _ 

2[l-5) + aM ■ (l-p)(2(l-A) + am)' 

(31) 

Proof: See Appendix]^ ■ 

According to the result of Lemma 1^ the NN-iT approximate 
Hessian inverses HJ ’ are strictly positive definite and have 
all of their eigenvalues bounded between the positive and finite 
constants A and A. This is true for all K and uniform across 
all iteration indexes t. Considering these eigenvalue bounds 
and the fact that —gt is a descent direction, the approximate 
Newton step —H) ' gt enforces convergence of the iterate 
to the optimal argument y* of the penalized objective function 
F(y) in (|^. In the following theorem we show that if the 
stepsize e is properly chosen, the sequence of objective function 
values F{yt) converges at least linearly to the optimal objective 
function value F{y*). 


Theorem 1 Consider the NN-K method as defined in @-([171 
and the objective function F{y) as introduced in Further, 
recall the definitions of the lower and upper bounds A and 
A, respectively, for the eigenvalues of the approximate Hessian 
inverse Hf in (@. If the stepsize e is chosen as 


e = min 


3mA 2 


LA3(F(yo)-E(y*)) = 


(32) 


and Assumptions 00 and ^ hold true, the sequence F{y-t) 
converges to the optimal argument F{y*) at least linearly with 
constant 0 < 1 — ^ < 1 . I.e., 


F{yf) - F{y*) < (1 - C)‘(i"(yo) - F(y*)), (33) 


where the constant 0 < C < 1 ii explicitly given by 


^ := (2 — e)eamX 


ae^LA^F{yo) - F{y*))i 
6 Ai 


(34) 


Proof: See Appendix]^ ■ 

Theorem [T] shows that the objective function error sequence 
F{yt) — F{y*) asymptoticly converges to zero and that the 
rate of convergence is at least linear. Note that according to the 
definition of the convergence parameter in Theorem [T] and 
the definitions of A and A in ( |3T| l, increasing a leads to faster 
convergence. This observation verifies existence of a tradeoff 
between rate and accuracy of convergence. For large values 
of a the sequence generated by Network Newton converges 
faster to the optimal solution of (j^. These faster convergence 
comes at the cost of increasing the distance between the optimal 
solutions of (j^ and ([^. Conversely, smaller a implies smaller 
gap between the optimal solutions of and ([^, but the 
convergence rate of NN-AT is slower. This suggests value in 
the use of adaptive strategies for the selection of a that we 
develop in ||^. 


V. Conclusions 

This paper developed the network Newton method as an ap¬ 
proximate Newton method for solving distributed optimization 
problems where the components of the objective function are 
available at different nodes of a network. The algorithm builds 
on a reinterpretation of distributed gradient descent as a penalty 
method and relies on an approximation of the Newton step of 
the corresponding penalized objective function. To approximate 
the Newton direction we truncate the Taylor series of the exact 
Newton step. This leads to a family of methods defined by the 
number K of Taylor series terms kept in the approximation. 
When we keep K terms of the Taylor series, the method is 
called NN-AT and can be implemented through the aggregation 
of information in A'-hop neighborhoods. We showed that the 
proposed method converges at least linearly to the solution of 
the penalized objective, and, consequently, to a neighborhood 
of the optimal argument for the original optimization problem. 
It follows from this convergence analysis that larger penalty 
coefficients result in faster convergence that comes at the cost 
of increasing the distance between the optimal solutions of the 
original and penalized objectives. 

This paper does not show any advantage of NN relative to 
distributed gradient descent, other than the expectation to see 
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improved convergence times due to the attempt to approximate 
the Newton direction of the penalized objective. These advan¬ 
tages are shown in a companion paper where we: (i) Show that 
the convergence rate is quadratic in a specific interval that can 
be made arbitrarily large by increasing K ||^. (ii) Use numerical 
results to establish the advantages of NN-iT in terms of number 
of iterations and communications steps relative to DGD. 

Appendix A 
Proof of Lemma[T] 

Consider two vectors y := [xi;...;x„] G and y := 
[xi;... ;x„] G Based on the Hessian expression in ( fTO) !, 
we simplify the Euclidean norm of the Hessian difference 
H(y) - H(y) as 

||H(y)-H(y)|| =a||G(y)-G(y)||. (35) 

The result in ( |T5| l is implied by the fact that the matrix I — Z 
does not depend on the argument y of the Hessian H(y). The 
next step is to bound the norm of the difference for two G 
matrices ||G(y) — G(y)|| in terms of the difference between 
two vectors y and y, i.e. ||y — y]]. 

According to the definition of G(y) in ( [TT] i, the difference 
matrix G(y) — G(y) is block diagonal and the ith diagonal 
block is 

G(y),, - G(y),, = VV»(x*) - (36) 


Note now that for any sequences of scalars and bi, the in¬ 
equality '£7=1 < (Efci a?KEr=r holds. If we divide 

both sides of this relation by £ 7=1 “ ||xi — Xi|| 

and bi = ||vi||, we obtain 


ELll|x^-X.|l2l|vdP 




ELillv.P 

Combining the two inequalities in ( |40l l and ( |4T] ) leads to 

n 

l|G(y) - G(y)||^ < maxL^^ ||xi - x, 


'■*ll 2 ■ 


(41) 


(42) 


i=r 


Since the right hand side of ( |42| does not depend on the vector 
V we can eliminate maximization with respect v. Further, note 
that according to the structure of vectors y and y, we can 
write ||y — yj]^ = £7=i l|xi — These two observations in 
association with ( |42l l imply that squared norm of the difference 
between matrices G(y) and G(y) is bounded above by 


||G(y)-G(y)|l2<L2||y_y||2_ (43) 


Taking the square root of both sides of (|4^ implies 


liG(y)-G(y)|| 2 <L|ly-y|| 2 . (44) 

According to ( |44l i we can conclude that the matrix G is Lip- 
schitz continuos with parameter L. Considering the expression 
in (|T5l) and the inequality in (|44li, the claim in (|2^ follows. 


Consider any vector v G and separate each p components 
of vector v and consider it as a new vector called G 
i.e. V := [vi;...; v„]. Observing the relation for the difference 
G(y) — G(y) in ( |3^ , the symmetry of matrices G(y) and 
G(y), and the dehnition ||A ||2 VAmaa:(A^A) of the Eu¬ 

clidean norm of a matrix, we obtain that the squared difference 
norm ||G(y) — G(y)||| can be written as 


l|G(y)-G(y)|| 2 = max- 


lG(y)-G(y)]^ 


= max 


T77=i [vVi(x*) - v2/*(xi)]' 


(37) 


Observe that each summand in ( |J7l l can be upper bounded by 
applying Cauchy-Schwarz inequality as 

vf [V^/*(x*)-V^/i(xi)]^Vi< ||vVi(x*)-V^/i(xj)||2||v,||^ 

(38) 

Substituting the upper bound in ( |38] l into ( |T7| ) implies that the 
squared norm ||G(y) — G(y )||2 is bounded above as 


I 

G(y)-G(y)||2 < max—7^ 


V 2 /.(x.)-V 2 /.(* 0 | 


(39) 

Observe that Assumption 2 states that the local objective func¬ 
tion Hessians V^/i(xi) are Lipschitz continuous with parameter 
L, i.e., ||V2/j(xi)-V2/i(xi)|| < L\\xi--ki\\. Considering this 
inequality the upper bound in ( |39| ) can be changed by replacing 
||V^/i(xi) - V 2 /*(xj)|| by L\\xi - xj which yields 


|G(y) - G(y)||2 < max 


l^j:7=i 


Xj — X,; 


E Jl 


(40) 


Appendix B 

Proof of Proposition[T] 

We first study the bounds for the eigenvalues of matrix I — Z. 
Notice that since I — Z can be written as (I„ — W) 0 Ip, all 
the eigenvalues of matrix I — Z are in the spectrum of matrix 
I — W. Therefore, we can study the bounds for the eigenvalues 
of matrix I — W in lieu of matrix I — Z. The Gershgorin 
circle theorem states that each eigenvalue of a matrix A lies 
within at least one of the Gershgorin discs D^aa^Ru) where 
the center an is the ith diagonal element of A and the radius 
Rii •= '^he sum of the absolute values of all 

the non-diagonal elements of the ith row. Note that matrix 
I — W is symmetric and as a result all eigenvalues are real. 
Hence, Gershgorin discs can be considered as intervals of width 
[an — Rii, an + Rn] for matrix I — W, where an — ^ — wn 
and Rn = £j^i\'uiij\- Since all the elements of matrix W 
are non-negative, \wij\ can be substituted by Wij. Therefore, 
all the eigenvalues of matrix I — W in at least one of the 
intervals [l-Wi-£-^-w^j, l-w^+£-^^ w^j]. Now observing 
that sum of the weights that a node assigns to itself and all 
the other nodes is one, i.e. £j Wij = 1, it can be derived 
that 1 — Wn = observation implies that the 

Gershgorin intervals can be simplified as [0, 2(1 — wn)] for 
i = 1,... ,n. This observation in association with the fact that 
2(1 — Wn) < 2(1 — (5) implies that all the eigenvalues of matrix 
I — W are in the interval [0, 2(1 — (5)] and consequently the 
eigenvalues of matrix I — Z are bounded as 

0 A I-Z A 2(1-5)1. (45) 

To prove bounds for the eigenvalues of Hessian Hj, first we 
hnd lower and upper bounds for the eigenvalues of matrix G*. 









Since matrix Gj is block diagonal and the eigenvalues of each 
diagonal block Gu^t — V^/i(xi () are bounded by constants 
0 <m<M<ooas mentioned in ( | 2 T| l, we obtain that the 
eigenvalues of matrix G( are bounded as 

ml ^ Gt ^ Ml. (46) 

Considering the dehnition of Hessian Ht := I — Z + aGt and 
the bounds in ( |45| ) and ( |46| l, the hrst claim follows. 

We proceed now to prove bounds for the eigenvalues of block 
diagonal matrix D(. According to the dehnition of matrix D* 
in © we can write 

Bt = aGt + (I„ - Wd) 0 Ip , (47) 

where is dehned as := diag(W). Note that matrix 
In-Wd is diagonal and the i-th diagonal component is l — wa. 
Since the local weights satisfy 6 < wu < A, we obtain that 
eigenvalues of matrix I„ — are bounded below and above 
by 1 — A and 1 — <5, respectively. Observe that the eigenvalue 
sets of matrices (I„ — W^;) and (I„ — W^) 0 Ip are identical 
which implies 

(l-A)I„p ^ (I„-Wd)(g)Ip ^ {l-<5)I„p (48) 

Considering the relation in ( |47| l and bounds in ( |46l l and ( |48] l, 
the second claim follows. 

Based on the dehnition of matrix B in ( [T3l l and the relation 
that Z = W C) I we can write 

B = (I - 2Wd + W) O I. (49) 

Hence, to bound eigenvalues of matrix B we study lower and 
upper bounds for the eigenvalues of matrix I — 2Wd + W. 
Observe that in the z-th row of matrix I — 2Wd + W, the 
diagonal component is 1 — wu and the jth component is Wij 
for all j i. Using Gershgorin theorem and the same argument 
that we established for the eigenvalues of I — Z, we can write 

0 A I-2Wd + W A 2(1-5)1. (50) 

Considering ( |50l l and the expression for matrix B in ( |49] l, the 
last claim follows. 


Appendix C 

Proof of Proposition!!] 

According to the result of Proposition 1, the block diagonal 
matrix Dt is positive dehnite and matrix B is positive semidef- 
inite which immediately implies that matrix ' BD^ ' is 
positive semidehnite and the lower bound in ( |27j l follows. 

Recall the dehnition of block diagonal matrix D( in ( fT 2 | ) and 
dehne matrix D as a special case of matrix Dj for a = 0. 
I.e., D := 2(1 — Z^). Notice that matrix D is diagonal, only 
depends on the structure of the network, and that it is also 
time invariant. Since matrix D is diagonal and each diagonal 
component 1 — wa is strictly larger than 0, matrix D is positive 
definite and invertible. Therefore, we can write ' BD^ ' 
as 

(d-5BD-5^ . 

(51) 

The next step is to hnd an upper bound for the eigenvalues of the 
symmetric term D~^/^BD~^/^ in Observing the fact that 


matrices D- 1 / 2 BD- 1/2 

and BD ^ are similar, eigenvalues of 
these matrices are identical. Therefore, we proceed to character¬ 
ize an upper bound for the eigenvalues of matrix BD~^. Based 
on the dehnitions of matrices B and D, the product BD ^ is 
given by 

BD-i = (I-2Zd + Z)(2(I-Zd))-i. (52) 


Therefore, the general form of matrix BD’ ^ is 


BD-i = 


1 

2 


I 

W21 T 
( 1 -uiii) 


__wi2_r 

{1-1022)'’' 

I 


Win r 

(l-Wnn)^ 
W2n r 
(l-Wnn)^ 


■ (53) 


W„1 T W„2 T 

(l-tuii)-*- {I-W22)’' 


I 


Note that each diagonal component of matrix BD ^ is 1/2 and 
that the sum of non-diagonal components of column i is 


E bd-Mj*] 

1 = 1 


1 

2 


E 






1 - Wi, 


1 

2 ' 


(54) 


Now, by considering the result in ( |54l l and applying Gershgorin 
theorem we can conclude that eigenvalues of matrix BD^^ are 
bounded as 


0<^i(BD-i)<l i = (55) 

where /Xi(BD~^) indicates the z-th eigenvalue of matrix 
BD~^. The bounds in ( |55] l and similarity of matrices BD~^ 
and D~^/^BD“^/^ show that the eigenvalues of matrix 
D-U2BD-1/2 

are uniformly bounded in the interval 

0 < ^i(D-i/2BD-i/2) < (56) 


Based on the decomposition in (jSTjl to characterize the 
bounds for the eigenvalues of matrix Dj ' BD^ ' , the 
bounds for the eigenvalues of matrix D^/^D^ should be 
studied as well. Notice that according to the dehnitions of 
matrices D and Dj, the product D^/^D^ is block diagonal 
and the z-th diagonal block is 


; 51 / 2 d- 1/2 




( QVVz(x»,t) \ 

'v 2(1-wu) ) 


(57) 


Observe that according to Assumption 1, the eigenvalues of 
local Hessian matrices V^/i(xi) are bounded by m and M. 
Further notice that the diagonal elements of weight matrix wu 
are bounded by 6 and A, i.e. S < wu < A. Considering these 
bounds we can show that the eigenvalues of matrices (a/ 2 (l — 
iPii))V^/i(xi.t) -P I are lower and upper bounded as 


am 


2(1-5) 


+ 1 


I A 


aV^/»(x,,t) 
2(1 - zuii) 


I A 


aM 


[2(1-A) 


+ 1 


I. 

^(58) 


By considering the bounds in ( |58| l, the eigenvalues of each block 
of matrix D^/^D^ as introduced in ( |57| ) are bounded below 
and above as 


2(1-A) 


2 ( 1 -A)-PaM 


I A 


aV^fi(xid) 

2(1 - zuii) 


A 


2(1-5) 


2(1 — 5) -p I 


(59) 


I. 
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Since ( |59] l holds for all the diagonal blocks of matrix 
the eigenvalues of this matrix also satisfy the 
bounds in (|59ll which implies that 


2(1-A) 


2(1 - A) +aM 


< ft. 


(d5D, 


< 


2(1-5) 


2(1 — 5) + I 


(60) 


for i = 1,... ,n. Observing the decomposition in dSTh, the norm 

_ ]^/2 _ 1/2 ' ' 

of the matrix ' BD^ ' is upper bounded as 




- 1/2 




D-1/2BD-1/2 


- 1/2 

Considering the symmetry of matrices ' and 

I)-i/ 2 bd~i/ 2 ^ and the upper bounds for their eigenvalues 
in ( |5^ and ( |60l l, respectively, we can substitute the norm of 
these two matrices by the upper bounds of their eigenvalues 
and simplify the upper bound in (|M]l to 


d; 


- 1/2 


BD 


- 1/2 


< 


2 ( 1 -^) 
2(1-5)-fa 


(62) 


Based on the upper bound for the norm of the matrix 
Dj in ( [62l i and the fact that matrix Dj 

is positive semidefinite, we can conclude that the eigen- 
values of matrix ' BDj ' are upper bounded by 
2(1 — 5)/(2(l — 5)-f am) and the right hand side of ( |27| ) 
follows. 


Appendix D 

Proof of Proposition[3] 

In this proof and the rest of the proofs we denote the Hessian 
approximation as instead of for simplification of 

equations. To prove lower and upper bounds for the eigenvalues 
of the error matrix E( we first develop a simplification for the 
matrix in the following lemma. 


Lemma 3 Consider the NN-K method as defined in 
The matrix I — can be simplified as 

I-H*Hr^ = (BDr^)^+\ (63) 

Proof : Considering the definitions of the Hessian inverse 
approximation in ( [T5] l and the matrix decomposition for 
the exact Hessian Hj = Dj — B, we obtain 

= (D* -B) (dp ^ (dPBDP)' dp ) . 

\ fc -0 / 

K 

= (D*-B)^DrpBD,-')". (64) 

By considering the result in ( [64| ), we simplify the expression 
as 

K 

I - H*H-i = I - (D, - B) y] D-i (BD-i)" 

= I-f;{BD,-‘)'‘ + f;{BD,-■)*«, (65) 

fc =0 fe =0 


In the right hand side of ( |65] > the identity matrix cancels out 
the first term in the sum (BD^"^) . The remaining terms 

of this sum are cancelled out by the first K terms of the sum 
Ef=o (BDP)^"^ so that the whole expression simplifies to 
(BD('^)^+^ as is claimed in ( |6^ . ■ 

Observing the fact that the error matrix Ej is a conjugate of 
the matrix I — and considering the simplification in 

Lemma we show that the eigenvalues of error matrix E( are 
bounded. 

Proof of Proposition |3l Recall the result of Proposition 121 
that all the eigenvalues of matrix D^ BD^ are uniformly 
bounded between 0 and p. Since matrices D^^^^BD)"^^^ and 
BtD, 1 are similar (conjugate) the sets of eigenvalues of these 
two matrices are identical. Therefore, eigenvalues of matrix 
BD-i are bounded as 

0 < /ri(BD^^) < p, (66) 

for i = 1,2,..., np. The bounds for the eigenvalues of matrix 
BD 1 in association with expression ( |6^ leads to the following 
bounds for the eigenvalues of matrix I — 

0 < p,(I-HtH,-i) < p^+\ (67) 

Observe that the error matrix Et = I — HjHj is the 

conjugate of matrix I — Hence, the bounds for the 

eigenvalues of matrix I —also hold for the eigenvalues 
of error matrix Ej and the claim in (|29l) follows. 


Appendix E 
Proof of Lemma|2] 


According to the Cauchy-Schwarz inequality, the product of 
the norms is larger than norm of the products. This observation 
and the definition of the approximate Hessian inverse in 
([B]) leads to 



< 


d: 


- 1/2 






1 K 


( 68 ) 


Observe that as a result of Proposition [T] the eigenvalues of 
matrix Dt are bounded below by 2(1 — A) -f am. Therefore, 
the maximum eigenvalue of its inverse D^"^ is smaller than 
1/(2(1 — A) +am). It then follows that the norm of the matrix 

_X /2 

T>^ " is bounded above as 



< 

1 



2(1 — A) -f am 


(69) 


Based on the result in Proposition the eigenvalues of the ma¬ 
trix D 7 ^/^BD 7 are smaller than p. Further using the sym- 
metry and positive definiteness of the matrix D^ ' BD^ ' , 


we obtain 




< P- 


(70) 


Using the triangle inequality in ( [ 68 ] l to claim that the norm of 
the sum is smaller than the sum of the norms and substituting 
the upper bounds in (|69|) and (|70| in the resulting expression 
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we obtain 



< 


1 

2(1 — A) + am 




(71) 


By considering the fact that p is smaller than 1, the sum 
J2k=o simplified to (1 — p^+^)/(l — p). Considering 

this simplification for the sum in ( |7T] l, the upper bound in (|30l ) 
for the eigenvalues of the approximate Hessian inverse 
follows. 

The next step is to provide a lower bound for the eigenvalues 
of the Hessian inverse approximation matrix In the 

Hessian inverse approximation formula ( [T5| ), all the summands 
except the first one, D^”^, are positive semidefinite. Hence, the 
approximate Hessian inverse is the sum of matrix 
and K positive semidefinite matrices and as a result we can 
conclude that 

D*-' A (72) 


Proposition [T] shows that the eigenvalues of matrix Dt are 
bounded above by 2(1 —(5)+aM which leads to the conclusion 
that there exits a lower bound for the eigenvalues of matrix 


D: 


1 

2(1-^) + q;M 


I ^ D, 


(73) 


Observing the relation in ( |72l l we realize that the lower bound 
for the eigenvalues of matrix in ( |7^ holds for the eigen¬ 
values of the Hessian inverse approximation H("^. Therefore, 
all the eigenvalues of the Hessian inverse approximation 
are greater than 1/(2(1 — (5) -f aM). This completes the proof 
of the claim in (|30ll. 


Appendix F 
Proof of Theorem [T] 

To prove global convergence of the Network Newton method 
we first introduce two technical lemmas. In the first lemma we 
use the result of Lemma namely, that the objective function 
Hessian V^F(y) is Lipschitz continuous, to develop an upper 
bound for the objective function value F{y) using the first three 
terms of its Taylor expansion. In the second lemma we construct 
an upper bound for the objective function error at step t + 1, 
namely F{yt+i) — F{y*), in terms of the error at step t, namely 
Fiyt)-F{y*). 

Lemma 4 Consider the objective function F{y) as defined in 
If Assumptions 2 and^hold true, then for any y,y G 
the following relation holds 

Fiy)<F{y)+VF{yf{y-y) (74) 

+ ^(y - yf V^7"(y)(y - y) + ^||y - y||3. 

Proof: Since objective function F is twice differentiable, based 
on the Fundamental Theorem of Calculus we can write 

F{y) = F{y)+[ VF(y + w(y-y))'^(y-y) dw, (75) 
Jo 

where VF is the gradient of function F. We proceed by adding 
and subtracting the term VF(y)^(y —y) to the right hand side 


of ( |75| l which yields 

F{y)=F{y)+VF{yf{y-y) (76) 

pi 

+ [VF(y + u;(y-y)) - VF(y)]^(y-y) dw. 

Jo 

We apply again the Fundamental Theorem of Calculus but for 
the gradient VF. It follows that for any vectors z,z G we 
can write 

VF(z) = VF(z) + /" W^F{z + s{z-z)){z-z) ds, (77) 
Jo 

where V^F is the Hessian of function F. After setting z = 
y -b a;(y — y) and z = y in ( fTT] ) and rearranging terms it 
follows that 


VF(y + a;(y-y))-VF(y) = (78) 

/ V2F(y + s(y + a;(y - y) - y))(y + a;(y - y) - y) ds. 


Jo 

Observing the fact that y + uj{y — y) — y = oj{y — y), we can 
further simplify (|78]l to 


VF(y + w(y - y)) - VF(y) = (79) 

[ V^F(y + suj{y - y))uj{y - y) ds. 
Jo 


Based on the relation for the difference of gradients VF(y -h 
w(y - y)) - VF(y) in we can rewrite ( |76l l by applying 
this substitution. This yields 


7^(y) = 7^(y) + VF(y)^(y-y) 


(80) 


w(y-y) V F{y + suj{y-y)){y-y)dsduj. 


0 Jo 


We proceed by adding and subtracting the quadratic integral 
flfo uj(y — y)^V‘^F(y)(y — y)dsduj to the right hand side of 
( |80l ) to write 

7"(y)=7"(y) + VF(y)^(y-y) 

+ [ [ uj{y-y)'^y/^F{y){y-y) ds duj 

Jo Jo 

(•1 /•! r 

^^(y - y)^ [v2F(y + suj{y - y)) - V^F{y) 
{y-y)dsduj. (81) 


0 Jo 


Observe that the term (y — y)^V^F(y)(y — y) in the third 
summand of ( |8T] l is not a function of w or s. Hence, we can 
move this term outside of the integral and simplify the integral 
to fo fo oj dsdu! = 1/2. As a result of these observation 
the third summand of can be replaced by (l/2)(y — 
y)^V^F(y)(y — y) and we can rewrite (|8T]) as 


Fif) = F(y)+VF(y)^(y - y) + ^(y - y)^V2F(y)(y - y) 



w(y-y)^ V^F(y + sa;(y-y)) 
(y - y) ds duj. 


V2F(y) 

(82) 


We proceed now to construct an upper bound for the integral in 
(|82|). Observe that according to the definition of the Euclidean 
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norm of a matrix we have the inequality (y — y)^ [V^i^(y + 
sw(y - y)) - V 2 F(y)] (y - y) < || V 2 F(y + sw(y - y)) - 
V^F(y)|j 2 ||y — y|p. By applying this inequality we have an 
upper bound for the integral in (|82li that results in 


of two consecutive variables as yt+i — y* = —g*. Making 
this substitution in (| 88 ll implies 


F{yt+i) < Fiyt) - egi U^^gt + -g*^ 


F{y)<Fiy) + VFiyf{y - y) + i(y - y)^V"F(y)(y - y) 


+ 


|y-y|: 


’[ f w||V^F(y + sa;(y-y))-V^F(y)|| 2 dsda;. 
Jo Jo 

(83) 


aLe^ . 


Hr'g*i 


According to the definition of error matrix in 


can substitute H 


- 1/2 


HtH* 


- 1/2 


(89) 


we 


by I — Ej. By making this 


substitution into the third summand of 


we obtain 


The next step is to provide an upper bound for the term 
II V^F(y+sw(y—y))—V^E(y )||2 in the right hand side of ( [83] l. 
Lemma [T] shows that the penalized objective function Hessian 
is Lipschitz continuous with parameter aL. Therefore, we 
can write 

||V^F(y+sa;(y-y))-V^F(y)|| <aL||y+sa;(y- y)-y|| 


i"(yt+i)<i"(yt)-eg/H,-'gi+-g/H, ^(I-EOH, "g* 


6 


lHr'g*f. 


(90) 


Proposition shows that the error matrix E( is always positive 
semidefinite. As a result, we conclude that the quadratic form 


= aLsu;||y-y||. (84) g* ^'^gt is always nonnegative. Considering this 


-1/2, 


By considering ( [ 8 ^ and substituting || V^F(y + sw(y — y)) — 
V^F(y )||2 by the upper bound in (|84ll, we obtain 


f f aLsw^lly — y|| dsduj. (85) 
Jo Jo 

Now observe that since ||y — y|| does not depend on s or w, 
the integral in the last summand of (|85|) can be simplified as 


lly-y|| 


n aLstu^lly — y|| dsdw = aL||y — y|| / / dsdu: 

Jo Jo 

= ^l|y-y||. ( 86 ) 

The simplification in ( | 86 | ) for the last summand of ( |85] l implies 
the claim in cn is valid. ■ 

Lemma shows an upper bound for the Taylor expansion of 
the objective function value F{y). We use the result of Lemma 
l^to establish an upper bound for the objective function error at 
step f + 1 in terms of the error at step t. This result is proven 
in the following lemma. 

Lemma 5 Consider the NN-K method as defined in 
and the objective function F{y) as defined in Further, 
recall the definition of y* as the optimal argument of the 
objective function F{y). If assumptions 01 and 1^ hold true, 
the sequence of objective function value errors {F(yt)—F(y*)} 
satisfies 

F{yt+i) - F{y*) < [l - (2e - e^) amX] [F(yt) - F(y*)] 

+ (87) 

6 A 2 

Proof: Recall the result of LemmaBy setting y := yt+i and 
y := yt in @ we obtain 


Fiyt+i) < F{yt) + gf (yt+i - yt) 


( 88 ) 


+ \ (yt+i -yt)’^Ht (yt+i -yt)+^ ||yt+i -yt f, 

where gt := VF(yt) and Ht := V^F(yt). From the definition 
of the NN-AT update formula in ([T6]l we can write the difference 


lower bound we can simplify 
( 2 e - e 2 ) 


to 


aLe^ , 


F{y) < F{y) + VF(y)^(y - y) + -(y - yfV^Fiy){y - y) 


gfHr'gt + -^llHr'gtf. 


7^(yt+i) < F(yt) - 

/ t» 

(91) 

Note that since the stepsize is not larger than 1, we obtain that 
2e — is positive. Moreover, recall the result of Lemma that 
all the eigenvalues of the Hessian inverse approximation 
are lower and upper bounded by A and A, respectively. These 
two observations imply that we can replace the term gfliif^g-t 
by its lower bound A||gt|p. Moreover, existence of upper bound 
A for the eigenvalues of Hessian inverse approximation 
implies that the term ||Hj"^g(|p is upper bounded by A^||gt||^. 
Substituting these bounds for the second and third terms of ( |9T] l 
and subtracting the optimal objective function value F{y*) from 
both sides of inequality (|9^ leads to 

F{yt+i) - Fiy*) < F(y,) - F{y*) - 


aLe^A^, 

(T 


|gtP 


(92) 


We now find lower and upper bounds for the norm of gradient 
|jg(|j in terms of the objective function error F{yt) — F{y*). 
As it follows from Proposition [T] the eigenvalues of Hessian H* 
are bounded by am and 2 + aM. Taking a Taylor expansion 
of the objective function F{y) around w and using the lower 
bound am for the Hessian eigenvalues yields 


F(y) > F(w) + VF(w)^(y - w) + — |ly - ■ 


(93) 


For fixed w, the right hand side of ( [9^ is a quadratic function of 
y whose minimum argument we can find by setting its gradient 
to zero. Doing this yields the minimizing argument y = w — 
(l/m)VF(w) implying that for all y we must have 


F{y) > F(w) + VF(w)^(y - w) + 

= - y^||VF(w)f. 

Zam 


lly-w|| 


(94) 


The bound in ( |94| ) is true for all w and y. In particular, for 
y = y* and w = yt ( |94| yields 

1 


F{y*) > F{yt) - 


Zam 


|VF(yt 


(95) 
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Rearrange terms in ( |95| l to obtain 2am{F{yt) — F{y*)) as a 
lower bound for |jVF(yt)|p = ligtP- Now substitute the lower 
bound 2am{F{yt)—F{y*)) for squared norm of gradient ||gt|p 
in the second summand of (|92]i to obtain 


F(yt+i)-F(y*)< [l-(2e- e^) amX]{F{yt)-F{y*)) 


+ --- 




(96) 


To find an upper bound for the norm of gradient ||gt|| in terms 
of the objective function error F{yt) — F{y*) we first use the 
Taylor expansion of the objective function F{y) around w by 
considering the fact that 2(1 — S) + aM is an upper bound for 
the eigenvalues of the Hessian. For any vectors y and y in 
we can write 


F{y) < F{y) + VF(y)^(y - y) + 2(1 ^) + ^^ ||y _ 

(97) 

Notice that according to the dehnition of A in iu we can 
substitute 2{l — S) + aM by 1/A. Implementing this substitution 
and minimizing both sides of the equality with respect to y 
yields 

Fiyn< F(y)-A|!VF(y)f. (98) 


Setting y = yt, observing that by dehnition || VJ^(yt)|| = ||gt||, 
rearranging terms, and taking the square root of both sides of 
the resulting inequality leads to 


llgtll < 


1 

l[F{yt)-F{y*)] ' 


(99) 


Replacing the upper bound in ( |9^ for the norm of the gradient 
ligtil in the last term of (|96l) yields the claim in (|87]i. ■ 


We use the result of Lemma to prove linear convergence of 
the sequence of objective function errors F{yt) — F{y*) to zero. 


Proof of Theorem To simplify upcoming derivations 
dehne the sequence /3t as 


Pt '■= (2 — e)eamX 


e^aLA^ [Fjyt)-F{y*)]"^ 
6Xi 


( 100 ) 


Recall the result of LemmaFactorizing F(yt) — F{y*) from 
the terms of the right hand side of ( |87| ) in association with the 
dehnition of Pt in (|100|l implies that we can simplify ([87ll as 


F{yt+i) - F{y*) < (1 - Pt){F{yt) - F{y*)). (101) 


To prove global convergence of objective function error F{yt) — 
F{y*) we need to show that for all time steps t, the constants Pt 
are strictly smaller than 1 and larger than 0, i.e., that 0 < A < 1 
for all times t. 

We hrst show that Pt is less than 1 for all f > 0. To do so 
observe that the second term in the right and side of ( |100| ) is 
nonnegative. It is therefore true that 


/3t < (2 — e)eamX. (102) 

Considering the inequality (e — 1)^ > 0 it is trivial to derive 
that e(2 — e) < 1. Moreover, considering the facts that m < M 
and 1 — (5 > 0, we obtain am < aM + (1 — i5) which yields 
am/{aM + 2(1 — 6)) < 1. Considering the dehnition of A 
in iB we can substitute 1/(2(1 — (5) + aM) by A and write 


amX < 1. By multiplying these two ratios, both of which are 
smaller than 1, we conclude that 


(2 - €)eamX < 1. (103) 


That /?t < 1 follows by combining ( |102| i with ( |103| l. 

To prove that 0 < pt for all f > 0 we prove that this is true 
for t = 0 and then prove that the Pt sequence is increasing. 
To show that Pq is positive hrst note that since the stepsize e 
satishes the condition in ([3^ we can write 


e < 


3toA2 ^ 

LA3(F(yo)-F(y*))i ’ 


(104) 


By computing the squares of both sides of ( |104| ), multiplying 
the right hand side of the resulting inequality by 2 to make 
the inequality strict, and factorizing amX from the term in the 
resulting right hand side we obtain 


e 2 < 


6 AS 

aLA^[F{yo) - F{y*)]i 


X amX. 


(105) 


If we now divide both sides of the inequality in ( |105| l by the 
hrst multiplicand in the right hand side of (|105|) we obtain 


e‘^aLA^[F{yo) - F(y*)]2 

6 Ai 


< amX. 


(106) 


Observe that based on the hypothesis in ( |3^ the step size e is 
smaller than 1 and it is then trivially true that 2 — e > 1. This 
observation shows that if we multiply the right hand side of 
( |106| l by 2(1 — e/2) the inequality still holds. 


e'^aLA^{F{yo) - F{y*))i ^ ^ ^ 

■— - 3 --—^—- < am{2 — e)X. 

6 A 2 

Furhter multiplying both sides of inequality ( |107| i by 
rearranging terms leads to 


(107) 
e and 


ame{2 — e)A 


e^aLA^[F{yo) - F{y*)]"2 

6 Ai 


(108) 


According to the dehnition of Pt in ( |100| ), the result in ( |108| l 
implies that Pq > 0. 

Observing that Pq is positive, to show that for all t the 
sequence of Pt is positive it is sufficient to prove that the 
sequence Pt is increasing, i.e., that Pt < Pt+i for all t. We 
use strong induction to prove Pt < Pt+i for all t > 0. By 
setting t = 0 in (|101|i the inequality can be written as 


F{yi) - F{y*) < (1 - /3o)(F(yo) - F(y*)). (109) 


Considering the result in ( |109| l and the fact that 0 < /3o < 1, we 
obtain that the objective function error at time f = 1 is strictly 
smaller than the error at time t = 0, i.e. 


F{yi)-F{y*) <F{yo)-F{y*). (110) 

Observe now that in the dehnition of sequence Pt in ( |100| l 
the objective function error term F{yt) — F{y*) appears in 
the numerator of negative term. Therefore, a smaller objective 
function error F{yt) — F{y*) leads to a larger coefficient Pt- 
Hence, this observation in association with the result in ( |110| i 
leads to the conclusion. 


Po < Pi- 


( 111 ) 
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To complete the strong induction argument assume now that 
/3o < /3i < • • • < Pt-i < Pt and proceed to prove that if this is 
true we must have /3t < Pt+i - Begin by observing that since 0 < 
/3o the induction hypothesis implies that for all u € {0 ,..., f} 
the constant /3„ is also positive, i.e., 0 < Pu- Further recall that 
for all t the sequence Pt is also smaller than 1 as already proved. 
Combining these two observations we can conclude that 0 < 
Pu < ^ for all M G {0,..., f}. Consider now the inequality in 
( | 101 | l and utilize the fact that 0 < /3„ < 1 for all u G {0 ,..., f} 
to conclude that 

F(y„+i)-i^(y*) <F^(y„)-F(y*), ( 112 ) 

for all u G {0, ... ,t}. Setting u = t in ( |112| ) we conclude that 
F{yt+i) - F(y*) < F {yt) - F{y*). By further repeating the 
argument leading from to ( | 110 [ ) we can conclude that 

Pt<Pt+i- (113) 

The strong induction proof is complete and we can now claim 
that for all times t 

0<Po<Pi<---<Pt<l. (114) 

The relationship in ( |101| i and the property in ( |114| i imply 
convergence of the objective function value sequence to the 
optimal argument, i.e. limt_>,oo F{yt) — F{y*) = 0. To conclude 
that the convergence rate is at least linear simply observe that if 
the sequence Pt is increasing as per ( |114| ), the sequence 1 — Pt 
is decreasing and satisfies 

0< 1-^t < l-/3o < 1, (115) 

for all time steps t. Applying the inequality in ( |101| i recursively 
and considering the inequality in ( |115| l yields 

Fiyt) - Fiy*) < (1 - Po)\F{yo) - F(y*)), (116) 

which shows the objective function error sequence F{yt) — 
F{y*) converges to 0 at least linearly with constant (1 — /3o). 
By setting C = Pq, the claim in ( |3^ follows. 
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